Hi! I am a third-year PhD student at Language Technologies Institute (LTI), Carnegie Mellon University (CMU), advised by Prof. Chenyan Xiong. I was fortunate to work with Dr. Scott Yih during my internship at Meta (2024). My primary research interests are:
- Intelligent and efficient LLM scaling with novel pretraining data curation and synthesis methods.
- Data valuation and influence attribution to better capture the impact of LLM training data.
When I am not doing research, I like to work out, play guitar, and watch movies.
Updates:
- October 2025: Check out our web recycling paper: RePro: Training Language Models to Faithfully Recycle the Web for Pretraining β¨
- June 2025: Check out our group-level data selection paper: Group-Level Data Selection for Efficient Pretraining at NeurIPS 2025 (poster) β¨
- June 2024: Check out our model-aware data selection paper: MATESπ§βπ€βπ§: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models at NeurIPS 2024 (poster) β¨
- December 2023: Check out our benchmarking LLMs paper: An In-depth Look at Gemini's Language Abilities β¨
- August 2023: Begin my PhD at CMU πͺ
- May 2023: Check out our generic retrieval augmentation paper: Augmentation-Adapted Retriever Improves Generalization of Language Models as Generic Plug-In at ACL 2023 (oral presentation) β¨
- August 2022: Check out our automatic prompting paper: Automatic Label Sequence Generation for Prompting Sequence-to-sequence Models at COLING 2022 (oral presentation) β¨